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A METHOD AND SYSTEM FOR PREDICTING AMINO ACID 
SEQUENCES COMPATIBLE WITH A SPECIFIED THBEE DIMENSIONAL 

STRUCTURE 



FIELD OF THE INVENTION 

Ibis invention relates to the field of protein design and more particularly to the 
field of inverse-protein folding for .ie novo Fotein design. 

PRIOR ART 

5 The follo^ving is a list of prior art which is considered to be pertinent for 

describing the state of the art in. the field of the invention. Acloiowledgement of these 
references herein will be made by indicating the number from their list below within 
brackets. 
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20 12. Desmet J, efaZ.Mzftire 356:539-542 (1992). 

13. DahiyatB.I.&MayoS.L.i'rorein5ci. 5:895-903 (1997). 



14. Dahiyat BJ. & Mayo Proc, Natl Acad, ScL USA 94:10172-10177 (14987). 

15. Maleikauskas S. M. and Mayo S. L. Nat. Struct. Biol. 5:470-475 (1998). 



BACKGROUND OF THE INVENTION 

Depending on the primary structure and the environment, proteins fold into a 
three-dimensional (3D) structure containing recurring motives which pack together to 
form the 3D structure, the most common motives observed being tiie a-helix, p-tuni, 
parallel and anti-parallel (J-sheetS- 

The 3D structure of a protein may be characterized as having internal surfaces 
being the areas buried within the structure and thus directed away from the aqueous 
environment in which the protein is normally found; external surfeces being the areas 
exposed to the aqueous environment and intermediate or boundary surfaces. Through 
the study of many natural proteins, researches have discovered that hydrophobic 
residues are most frequently found on the intcraal surface of water soluble protein 
molecules while hydrophilic residues are most frequently found on the external protein 
surfaces. 

It was established that while the biological properties of a protein depend 
directly on the protein's 3D conformation, only some of the information in the protein's 
sequence is necessary to spedfy its fold^ i.e. a given native structure may be formed 
- from many different sequences (Lau K.F. and Dill K. A. PNAS USA 87:638-652 
(1990)]. The different sequences compatible with a given 3D structure are referred to 
as tlie structure's Sequence Space. The finding that a number of amino acid sequences 
may fold into the same basic 3D structure, have focused attention on a new field 
commonly referred to as the "inverse protein folding'* or "de novo protein desi^'*. 
While conventional protein folding methods are trying to predict the tertiary structure 
of a protein, from their amino acids sequence, protein design methods are looking for a 
sequence that will stabilize a given fold, by using the same principals. 
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Repoits of experimentally predicted amino add sequences which , adopt an 
intended told and possess physical properties similar at least in part to those of natural 
proteins are appearing with increasing frequency [Kortemme T. et al. Science 
281:253-256 (1998); Kurda Y. et al. ./. Mol. Biol. 236:862-868 (1994); Quinn T. P. et 
5 al PNAS USA 91:8747-8751 (1994); Fezoui Y- et al PNAS USA 91:3675-3679 (1994); 
Betz S.F et al Curr. Opin. Struc. Biol. 5:457-463 (1995); Raleigh D.P. et all Am. 
Chem. Soc. 117:755-7559 (1995); Regan. L. & DeGardo ^I. Science 241:976-978 
(1988); Hecht M.H. et aL Science 249:884-891; Beauregard M. et al: Protein Eng. 
4:745-749 (1991) Kamtekar S. et al 5cz€«c« 262:1680-1685 (1993)]. These studies 
,0 have been predominantly experimental and rely on Imowledge of the physical 
properties that determine the protein's structure, such as the patterns of hydrophobic 
and hydrophilic residues in the sequence. 

Several groups have applied an experimentally tested systematic, quantitative 
methods to protein design with the goal of developing general design algorithms. 
,5 Desjarlais an.d Handel ^'^ were the first to experimenially investigate predictions 
generated by genetic algorithms (OA). They have developed ROC ("Repacking of 
Cores"), a computational program that attempts to find novel core sequences given the 
backbone structure of the protein of interest. In different, however related, work, a 
modification of the ROC was used on the secondary structure of the ap protein 
20 ubiquitin^'l The program used a genetic algorithm to optimize the search for 
alternative core structures for a given protein. Other experimentally tested methods 
applied with respect to protein design are described cUcwhcre^^'^'l Thus, in some 
cases, uniquely folded and even functional globular proteins may be obtained using 
highly simplified minimally designed cores. The algorithms consider the spatial 
25 positioning and steric cornplement of side chains by explicitly modeling the atoiris of 
sequences under consideration. However, despite the success of these studies, a full 
predictive understanding of hydrophobic core packing in proteins has not yet been fiilly 
realized, and de novo design of stable and unique proteins, remains a challenging 
problem. 
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A „,aior ..eaktbrous. - achieve. *e De-d-End E— , O^EE) 
, -.IC Desn-e. « d.<-'. which was ori^Uy developed fot homology modelmg. 
aigonfl^ W Des^e. « ^ ^,^c* provable .obe inco»sis,e« 

DEE finds and eliminates rotamers mat «u 

(ordead=nding)with,hc.lobal«ini»--e.ergysoMo„of*e,ys.e„. 

Dahiyat and Mayo-> adap^d me algorism by De^e, fo. *e ^^^^^^ 

,„ solvent-exposedsurface.andthebow.da^b.^'e^-reands^ce. 

SUMMARY OF THE INVENTION 

to accordance vrt* a fe« of i^ aspects. *e present invention relate «« a 
m accorodi sequence 

compatible ^vitb a predefined three-dimensional (3D) structure, 
15 the steps of:- 

a) providingacoordinatesetrep^et^i-^^ebaclcboaeof said3Ds«c«.re; 

b, Instructing a reduced virbral representation for .he 3D s,rucu« provided . 

c) :l!^ingforeacHposiUona.ons*evir^s.nKh^repr.en<a>ionpr<>v*^ 

20 . in Step (b) its solvent accessibility; 

cons»ctinganiniM»„inoacidse<pencebyas3i^ing.-hposi«ona,«,gthe 

' se,^„ce an amino acid resi^re selected randonUy ftonr a predefined gro^ o 
Jtao acids having a solvent accessibiU^ (SA, co«pa*le wifl. the soiven. 
accessibility determined for each position; 

„ c) randonay selecting one or more positions along the se,uence provided in^^ 
(d, and applying on each position a Monte^arlo simulation n. sequence ^a^ 
and ^tamer space, said simnlad^ con^rising one or more scorrng tocon 



calcxilating steps which include:- 

.u^domly selecting one or more amino acid residues of the same solvent 
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ii) 



iii) 



iv) 



accessibility as that defined for said position to provide a mutation; 

calculating an energy scoring function for each possible rotamer of e^^ 
amino acid residue provided in step (i) based on their said reduced vortual 

representation; 

selecting (he lowest scottag iot»i»et ot »h«i more to o.e >mmo add b 
^ulated simul.an«,«3ly. «leaing the lowest scoring rotamer 

combination; 

. determining whether to accept or reject the mutation with the rotamer or 
rotamer combination selected in step (ui), by applyirvg, for example, the 

Metropolis algorithm; and 

assigning the amino acid residue or residues and their respective selected 
rotamer or rotamer combinations selected in step (iii) to said position/s and 
moving to another position along the sequence; 

said simulation steps are repeated until for each position along said sequence, 
the residue and residue's rotamer. with th. lowest score is selected, to obtain a virtually 
represented amino acid sequence with the lowest total score; 

f) expanding the reduced representation of the amino acid sequence obtained in 
step (e) to its corresponding all-atom sequence representation thereby obtammg 
an amino acid sequence compatible with said predefined 3D structure; and 

g) optionally, creating a computer output of the expanded aU-atom representation 

of the amino acid sequence obtained in step (f). 

According to a second aspect, the invention provides amino acid sequences 
which fold into predefined 3D strucmres, the amino acid sequences being obtained by 
the method of the present invention. 



in 

E ,■ 

1=4 



-6- 



•^=.c m accordance with another of its aspects, a 
Purtherruote, the mventiOD provides, m accordance w : ^ 

u . t.n, for oredictino an aanino acid sequence compatible with a 
computer-based system tor predicting ai • ^^h- faMnnut 

.redefined 3D structure comprising a computer device equipped with- (a)mpu 
s cb as a Keyboard, for specifying said 3D structure; (b) a first memory fo. 

' ;T:a.onio^amwhichwhenru^^^ 

Latible Lti. tb.e specified 3D stn^cture; (d) a third memory f^^ 

rl!^^ acid sec^elce obtained; (e) a processor coupled to said input me^^^ 
irrsecond Id ^ird memo.es for re^^^^^^^^ 
. l^olU^-ispiay^ . 
amino acid sequence. 

^ ^Jor available oo .is.eu=, CD o. .ape »McK U to downloaded o«o *e 
4e netwoiK _ . ^ ..f^^ ^,^.r«5" sigmfies also any suitable 

fint memoty module. Thus, the teim PP 
, ..neans for cotmecting » » »«»ot. andretricvittg ^ avadable ^^^^ 
the^by *e desited 3D T^^'. ^ '^"'^ ^ " 

^ enabling rcMeving such se^ences irom computer readable ,ned»n», e.^ 

^ch whea tunnins on the ^ d=«c. enables the ptocessinsof.be .oted da. 
" . as to pto^d. a an anloo acid se^naace whicb substantUily folds .nto a dested 3D 
st^etute, i.e. Utat specified in step (a) of the n,e,hod of *e inve«icn. such a c»„^^ 
.evice deludes. a ptivate con,p«.et (PC, eite ~ » Lmux OS), 

wotlcstatton computers (UNDO, a computet-cluster ot Super-con>putets. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

In order to understand the invention and to see how it may be carried out in 
practice, some non-li^iting examples will now be described, with reference to the 
accompanying drawings, in which: 
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: j,,.;siowsthce.ergypr«m=ob,ai„e<.forcheZW«Sbackboneby*esys.em 
of the i„v=„«o„, during .0' i.«a«o., at teee ,»pcram«s: lOOOK (con««uo«. U..,, 
500K (broken lin.) and lOOK (dotted line). The temperance re«»med,c»=.taa, d^ 
. the stoula«o„: Tlte energy of the inM random se^,=n=e, before .imulaam 

5 was +204kcal/mol. 

Fi. 2 shows the energy profile obtained for the Zz«2dS bac^^^^ 
of the invention during 10^ iterations, at t^ce maW ten.peran.es: lOOOK 
(continuous Une), 500K (broken U.e) and lOOK (dotted line) using an annealmg 
' tenxperatore profile, with pcriodici^ of 500 Monte Carlo steps during which, the 
,0 temperature is gradually decreased from its initial value, to zero, and then set up agam 
to its initial value for another cycle. The energy of the initial random sequence, before 
^ the simulation started was +204kcal/ro.ol. 

Fi<^ 3 shows the energies of the 20 lowest sequences generated by the algorithm 
at different simulation lengths and different temperatures, using an annealing 
,5 temperature profile wi& periodicity of 500 Monte Carlo Steps. 

Fig 4A-4C shows the three dimensional structure crystallography of Zi£268 
(Frg. 4A)'compared with the 3D structure of the designed proteins A and B (Figs. 4B 
and 4C, respectively). 

Fig 5A-5C shows the 3D crystallography structure of Zif268 (Fig 5A) 
-,0 compar J with the three dimensional structure of the designed proteins ^ and S (Fi^ 
5B and 5C, respectively), after minimization of their side chains, displayed by spheres 
sized to the van der Walls radii of the atoms (not including hydrogens). 

Fig 6 shows a diagram of Gpl solvent accessibility, according to the present 
invention's methodology (black columns) and according to D&M (gray columns). 

25 Fig 7A-7B shows the 3D structure of Gpi overlaid on that of the designed 

sequence C from two different angles (Fig. 7 A and 7B). 
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DETAILED DESCRIPTION OF THE INVENTION 

The present invention relates in general to a method of predicting one or more 
^«nino acids compatible with a predefined 3D structure. The predefined 3D structure 
may be that of a native protein, polypeptide, a biologicaUy fonctional derivative or 
5. fraction of the native protein or polypeptide or any other biologically functional 
polymer, the determination of its lowest firee-energy-structure is desirable. 

In the current disclosure, above and below, the terms "amino acid sequence", 
•'primary sequence' and other similar terms or derivation thereof, may be used 
interchangeably. These terms, as used herein, refer to an amino acid sequence of a 
.0 protein or polypeptide. The primary structure of a protein or polypeptide is the amino 
acid sequence ; wherein die location of disulfide bridges, if any exist, arc indicated. The 
primary stnicture is thus a complete descripti on of the covalent connections within the 
polymer. 

The term "amino acid" as used herein above and below means any organic 
,5 compound possessing one or more amino groups and one or more carboxyl groups. 
Such amino acids may be naturally occurring L-amino acids, their corresponding 
D-isomers, synthetic amino acids, or any other variations of the same. Withm this 
context, the term variant should be undeirstood as including all possible modifications 
of the naturally occurring or synthetic amino acids including deletions, insertions, 
20 substitutions of group/s therein. By the term "amino acid residue", it should be 
understood an amino acid, as defined above, which forms part of a chain, the chain 
consisting two or more amino acid units. 

The coordinate set including the dihedral angles and specific bonds within the 
predefined 3D structure, may be obtained firom any suitable databank Icnown to those 
25 versed in tlie art, such as the Protein Data Bank (PDB, supported by the RCSB 
consortium) and is preferably provided in a computer readable form to enable its easy 
input into the system of the invention. Alternatively, the 3D structure may be defined at 
will, with.out relying on any known 3D structure of any specific protein. Such novel 3D 



structures will; agree with the general structure constraints of polypeptides, such as ; 
backbone geometries, ais known to those versed in the art. 

According to the method of the invention, a reduced virtual representation is Gist 
constructed for the predefined 3D stmcture. The reduced representation may be obtained 
by the methodology originally develc^ed by Herzyl and Hubbard for use with dynamic 
simulated annealing [Her2yk P. and Hubbard R.E. Proteins 17:310-324 (1993)]. 
According to tliis methodology, the amino acids are represented by virtual spherical 
atoms, wherein the mairi chain of the protein, polypeptide or any other suitable polymer 
is represented by one virtual atom per residue located at the Ca position and the side 
chains ate represented by one or more additional virtual atoms. The number of 
additional virmal atoms depends on the size and chemical compositi.on of the specific 
side chain. 

Typically, one additional virtual atom will represent amino acid residues having 
only a (} side chain heavy atom or [3 and y side chains heavy atoms, e.g, serine (Ser, S), 
threonine (Thr, T), alanine (Ala, A), valine (Val, V), cysteine (Cys, C). Proline will 
also consist part of this group as its C5 heavy atom is very close to its Ca and Cp 
atoms. Two additional virtual atoms will represent amino acid residues having y and 
6 side chains heavy atoms, p being represented by one vimial atom and 7 and 5 
togetlier by another virtual atom. It should be noted that the representation with two 
additional side chain virtual atoms exhibit rotational flexibility aroxmd the Cp-Cy bond, 
e.g. histidine (His, H), aspartic acid (Asp, D), asparagine (Asn, N), tyrosine (Tyr, Y), 
leucine (Leu, L), isoleucine (He, I), phenylalanine (Pbe, F) and methionme (Met, M). 
Three additional virtual atoms will represent amino acid such as lysine (Lys, L), 
arginine (Arg, R), glutamic acid (Glu, E), Qlutamine (Gin, Q) and tryptophane (lip, 
W). Evidently, amino acids, odier than the naturally occurring amino acids (e.g. 
chemical modifications or synthetic variations thereof) may be presented by virtual 
atoms in a similar manner. 

After constructing the reduced representation for the 3D structure provided, the 
extent of solvent accessibility at each position along the 3D structure is determined. In 
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priiicipal, solvent accessibility (SA) is a feature assigned for each position along tbe 
chain of the folded protein or polypeptide. Each position is categorized as being either 
buried within the 3D structure (in an internal surface), exposed (part of an external 
surface) or within a boundary surface (/n^^rm^^/ial^ position). 

5 According to one preferred embodiment of the invention, the SA is determined 

by suiTounding the reduced representation of the protein with a grid and calculating the 
nuinber of grid points that fall into the intersection volume of the volume of every 
virtual atom and the volume of its neighbor virtual atom (the volume determined 
according to the adequate van der Waals radius [Bernstein F. C. et al J. Mol. Biol. 

10 112:535-542 (1977)]). However, it should be clear that other ways of determining an 
amino acid's SA can be employed as may be known to the man versed in the art. 

Each type of position, i.e. buried, exposed or intermediate, may be occupied by 
several amino acid residues. Hydrophobic amino acids, being able to form a 
hydrophobic core, are assigned to the buried positions of the 3D structure, while 
15 hydrophilic amino acids are assigned for the solvent-exposed positions. Boundary 
positions, between those two environments can be occupied by both types of amino 
acids. 

According to one particular embodiment of the invention, the buried positions 
may be occupies by amino acids selected from the group consisting of Ala, Tyr, Tip, 
20 Val, Leu, He, Phe, Met, Cys, Pro, Gly and variants thereof, all of which bemg 
hydrophobic in nature. 

The exposed positions may be occupied by amino acid residues selected from the 
group consisting of Lys, Aig, His, Glu, Asp, Gin, Asn, Ser, Tbr and variants thereof all of 
which being hydrophilic in nature. 

25 The positions having assigned an intermediate level of SA, may be occupied by 

all types of amino acids, particularly those which serve in nature as building blocks for 
proteins, i.e. Pro, Lys, Arg, His, Glu, Asp, Gin, Asn, Ser, Thr, Gly, Ala, Tyr, Trp, Val, 
Leu, He, Phe, Met, Cys and variants thereof. 
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Tt should be noted, however, that the above classification of the buried, 
intermediate and exposed residues is only one example of classification and these 
groups may be changes. 

. Special assignment of patterns may be set for particular positions in the 
protein's 3D structure that deviate from the general assignment based on SA, such 
assignment may be introduced, for example, to preserve buried salt bridges. 

After assigning each position with its characteristic SA (1) an amino acid for 
every Cot position is selected randomly, taking into account the solvent accessibility of 
that position (buried, exposed or intermediate). Alternatively, this selection may be 
applied only to a sub-set of the polypeptide's amino acid residue, leaving the other 
positions fixed throughout the design process; (2) the appropriate bonds and ang).es of 
the protein's reduced representation are assigned based on the coordinate set provided; 
and (3) one of the physically permissible rotamers that characterizes each of the amino 
acids is assigned. 

A scoring fiinction is applied to evaluate the effect of changing the amino acid 
sequence and residue rotamer for a given 3D backbone structure. The scoring fiinction 
used according to the present invention has two main contributions: a residue-residue 
interaction term and a residue's secondary-structure propensity term. The interaction 
: part of the scoring function is based in part on theLenard-Joncs like potential function, 
multiplied by the effective attractive inter-residue contact energies (sij). The energy is 
summed over all possible pairs of residues in the protein where the energy between 
each pair residues is a sura of the interaction energies between all possible pairs of 
virtual atoms except for the energy between two neighboring Ca virtual atoros that do 
not contribute to the interaction. Since each side chain is represented by one or more 
virtual atoms, the energy contribution of each residue-residue pair is divided among the 
virtual atoms such that the sum over energy contributions of the virtual atom pairs (one 
fironi each residue) equals the effective residue-residue interaction. The Lenard-Jones 
potential fiinction is modified to malce the effect of repulsion smaller (because the 
virtual atoms are 'softer* than real atoms). 



■: - .--:-r^-'; . -12 - ■ 

«,os. calculated by Miya^«3 and Jo^igaB (Miya.a« S. and Jenngao RL. I Mol. 
^^23- J()996)]. The b.. a,.™ptio„ o. which «,e contact cnetg,. jc 

„ ob^rved ta, a latge of ctysta. struck, pf ^ob^ Protems, 

^.e:.3 *c act^M ^.^ic i«ct-te,idue 0,0. contact, of ptot™ ~ 

Sccondaty stntctutc prcpeaswes a« auo included in scoring tocti^^^ 
«.end>cto.alen«s,^of.hop.o,^t.iscalcu,atedbyaddlnsa^„m . 

by Bahat . ./. [Bal-a. I, Proteins 2,:29.O0S (.997,,. These .^^^^^ 

S:^:::add...yif*e.sidneissi«a,ed.a.haicalota^^^^^^ 

J- ^ t^fi,^ c*>roTidarv Structure of the designeaproiem 
the 3D backbone template, according to the secondary struc 

or polypeptide. 

The scoring taction is appHed a^ patt of a Monte carlo siotula^on w«t 

; ^.ines a search in the seqnence space for amino acid ,»idu« and » the sp«^ 
' .pace of each residue. This process provides the systetn ^.th the op^ 

for a siven backbone. The tertn >p«»./ — " ~ 
se^ence con,patible v..h the predefined 3D structure and havutg *e lowest total 



score. 



,„ The temt "^e,^- spac^ " "^"T 

3e<^encesfora^ennurnberofdtffercn.residaesanda^vennun,berof»^^^^ 

pin. polypeptide or any o*er appropriate polynter. e... for a J^^; 
Lnposlf 20 different antino acids, dtescuenee space ^contatn 20 poss* 

. 3e<rLces. TV tern, « 'otal nun^et of phystcally 

„ permissible conformations for a residue in a given amino acid sequence. 

The advantage of the combined reducedreprescntation ot-heside chains and dte 
grouping of amino acids and stmc»re sites according to *e solvent accessibiUty reU«» 
the high effidenc, of searching throu^ b«h sequence space and rotamer space. The 



combined simplifications, dramatically reduce the search space, while retaining a 
physically reasonable representation that can accurately account for rotamer flexibility. 

The search in sequence space begins when up to three positions aJoiig the protein 
are simultaneously randomJy selected and replaced with different amino acids (each 

5 replacement referred to as a 'mutant* in that specific 'trial configuration'). The replacing 
amino acids are selected randomly from the group of amino acids having the same 
characteristics (buried, exposed or intermediate) as defined for the specific replaced 
positions to form tb.e 'mutant. If any of the new mutation residues has naore than one 
virtual side chain atom^ a search in rotamer space begins by calculating the total energy 

10 score of the new sequence for each and every allowed rotamers or rotamer Combinations 
of the mutated amino acids (not all rotamers are allowed, as described by Ponder and 
Richards [Ponder J.W and Richards F. M. J. Mol Biol. 193(4):775-791 (1987)]). The 
energy score difference aE, between the lowest energy score of the trial configuration 
being the lowest energy score among all allowed rotamers of the new mutant (or mutants, 

15 if more than one amino acid is replaced), and the energy of tlie last accepted 
configuration is calculated. The Metropolis algorithm (Metropolis N. & Ulam S:J. ArrL 
Stat Ass, 44:335-341 (1949)] is used to detennine whether the new trial, sequence is 
accepted or rejected. If aE is negative, the mutation is accepted with the best rotamers, 
otherwise, the trial configuration is accepted at a probability determined according to the 

20 Boltzmann distribution ^'^^ (T being eii!her a fixed or a varying annealing temperature 
as will be described hereinbelow), 

Tlie search continues throu^ a large number of trials (steps) in order to allow the 
score to decrease and converge. This number depends on the size of the protein and on 
the number of residues that are allowed to be mutated (in case the design is just of a 
23 certain part of the protein). 

It is possible to perfonn multiple mutations m each simulation step with 
adjustable probabilities to determine whether one or more mutations take place at any 
given trial step. If Uvo or more mutations take place simultaneously, the minimal energy 



. score among all rotamer combinations of those mutations in the new sequence that has 
side chains of two or more virmal atoms, is searched. 

Throughout the simulation, the lowest ''scored" sequences are selected. The fina] 
optimal sequence is the one associated with the lowest total energy score found dxiring 
5 the optimization process. The resulting sequence is then expanded to its corresponding 
3D all-atom representation (as opposed to the virtual representation). This all atom 
representation may then be either saved in a computer readable form or extracted in the 
form of a computer output. Also coUected are additional low energy score sequences, in 
order to enable analyzing relative consistency patterns of residues in a given position,, 

10 To evaluate whether the sequence obtained is indeed compatible with the 

predefined 3D structure, the all atom 3'D model that is constructed from the novel 
sequence can be analyzed. There are several methods for determining whether a designed 
i2 amino acid sequence indeed folds into a predefined 3D structure. One way of perfoxming 

^ such an evaluation is to compare the structure of the designed protein (after standard 

ry 15 all-atom minimization of its side chains with the structure of the model after molecular 

dynamics simulation in water or by comparing the molecular mechanics eD;ergy of the 
wjld-t>'pc protein from which the 3D structure used, was taken, with that of the designed 
protein, after molecular dynamics of the latter. Molecular dynamics programs, such as 
CHARMM [Brooks, B.R. et ai J. Comp., Chem 4:187-219 (1983)] may be utilized for 
20 this purpose, as illustrated in tlie following Examples. 

The amino acid sequence designed by the method of the invention is a de novo 
sequence and preferably a sequence, which under physiological conditions folds 
substantially into the desired 3D structure. More preferably, the amino acid sequences 
obtained are biologically functional. 

25 The sequence obtained may be used for various applications. According to one 

preferred embodiment, the designed amino acid sequence is chemically synthesized by 
procedures known in the art. 



Tn a further preferred embodiment, the novel amino acid sequence is used to 
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create a nucleic acid sequence, such as DNA, which encodes the optimal sequence. A ; 
man versed in the art would know based on the existing technologies how to deduce at 
least one nucleic acid which will encode the amino acid sequence designed. The 
nucleic acid sequence obtained may then be cloned into a host cell and e:q)ressed. The 
;5 choice of codons, suitable expression vectors and suitable host cells may vary 
depending on a number of factors, and can be easily optimized as needed 

Once made, the novel amino acid sequence may be experimentally evaluated 
and tested for structure, function and stability, as required. This will be performed as is 
knov^rTi in the art and will depend in part on the original protein from which the 

10 sequence's backbone structure was taken. Preferably, the designed protein will be more 
stable than the known protein used as the starting point although; at times, if some 
constraints are placed on the method disclosed herein, the designed sequence may be 
less stable. For example, it is possible to fix certain residues for altered biological 
activity and find the most stable sequence, but it may still be less stable tlian the wild 

] 5 type protein. Stable in this context includes, but is not limited thereto, thermal stability, 
i.e. an. increase in the temperature at which reversible or irreversible denaturing starts 
to occur; proteolytic stability, i.e. decrease in the amount of protein which is 
irreversibly cleaved in the presence of a parti.cular protease (including autolysis); 
stability to alteration in pH or oxidative conditions; chelator stability; stability to metal 

20 ions; stability to solvents such as organic solvents, surfactants, formulation chemicals, 
etc. 

The proteins of the invention, and naturally, the nucleic acid deduced therefrom, 
may be used in a variety of applications, ranging from industrial to pharmacological 
uses, depending on the protein. Example of the different uses are in biotechnology 
25 manufacturing of therapeutic peptides and proteins, in gene therapy, design of 
modified therapeutic peptides and proteins as pharmaceuticals, etc. 

An other application of the invention disclosed herein may be the generation of a 
library of small stable protein elements that can be later assembled in variotis ways to 
design a sequence for a novel larger protein with a desired 3D structure. Yet further. 



the method of tb.e present invention may be applicable for optimizing the novel larger 
protein obtained thus ensuring that the peptides from which it was constructed indeed 
fit the structure. 

In view of the above, the invention further provides amino acid sequences 
5 substantially compatible with a specified 3D structure, the amino acid sequences being 
obtained by the method of the present invention. 

Yet further, in accordance with another of its aspect, there is provided a 
computer-based system for predicting an amino acid sequence compatible with a 
specified 3D structure, the system comprising the constituents as defmed hereinbefore 
) 0 and after. 

The input apparatus, such as a keyboard, employed by the system of the 
invention, are used for entering a selected set of coordinates representing the 
predefined 3D structure and other data such as scoring fimction and optimization 
process parameters. The first and third memory means being preferably a RAM 

, 5 (random access memory) are used for storing the initial and final data while the second 
memory means, being preferably a ROM (read-only memory) are used to store the 
program of the method of present invention. Further, the system comprises a 
microprocessor for performing, under control of the stored program, the steps of 
processing the entered data and displaying via a display unit or printer the novel amino 

20 acid sequence. 

A user enters the coordinate set for the predefined 3D structtire from an 
optional, auxiliary storage unit. In response to entry of the coordinate set, the system 
inputs the data for processing, stores the data in memory then processes it as described. 
The data provided regarding the 3D structares is typically retrieved fi-om existing files 
23 known to those versed in the art and available to them. 
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SPECIFIC EXAMPLES 

The present mv=nd« U defmed by *e claims, .he coaunts or^^^^ 

^ 3. included .i«n *e disclos^^e of *e spe.ifica.o^ and .vU. .0.- be described by 
way of example with reference to the accompanying Figures. 

5 GENERAL 

CHAmMminbmaiaonmdmoUadar^l'^mics 

The ccmoarison between the all-atom 3D s«c«re of the designed protein (aflcr 

mininrizaaon »; its side Chair.) with its su^n^e after molecular dynam^^ 
earned cut in .b,e foEowing specific E^mples usins the CHAW^ molecular 
p^gram [version 29,BroolcsB.R.e,a/.(19«)i«l.Further,th.oomparisonbetween 

L averaged e„e^ of *c desired protein after dynamics with the energy o *e 
: „,,vcproteh.iscarriedoutU3ingCHARMMfcrcefield(Mack=reUAX..«a;.7.P*V.. 

. 102:3586-3616 (1998)). The minimization and the molecular dynamics ^ 

performed when the protein is embedded in a water sphere. For native proteins *e 
„ coordinates are based on the information provided FDB, The conformauon of flte 
desired protein is composed of «>e backbone conformation of the native prot«n and 
a,e side "chain, conformation of the new residues, accordmg to the best rotamers 
chosen by the method of the invention. ^ 

CHARMM executes two minimizatioQ ai.goritnms to the protein's side chains, 
,0 Steepest Descent (SD) ..d Adopted Basis Newton. Raphson (ABNR). 

minimisation, of the side chains and of the water surrounding the protein, CHARMM 
performs molecular dynamics of the protein. 

Example 1 - Zi£268 as a target fold 

ID order to examined the method according to the invention the m motif 
25 typified by the zinc finger DNA binding module in the zinc finger protein, Zif268 was 

used Zif268 ,s a well recognized protein. TTus protein is small, enough to be both 
computationally and experimentally tractable, yet large enough to form an 
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ind^endemly folded sTOcture in the abs^ce of disulfide bonds or m«al bmdmg. 
Although fti3 motif consists of fe^-cr than 30 residues, it does contain sheet, helix and 
...n, .mtcmres. By the method and system of the invention the entire atnino acid 
seqtience: the buried core, the solvent exposed surtace and the boundary bet>.-een core 
, and surface, except for the Oly27, which was not mutated durihg the shnulation, was 
computed. Tlte input coordinates are these of residues 33-60 of fte naUve proteins 
obttuned from, the X-ray smtoure coordinate otZif268 immediate early gene (krox.24) 
complexwithan 1 1 base pair DNA fogment (Protein Data Bank (PDB) code: IZAA). 
as determined a. 2.1 A resolution by Pavletich and Pabo (Pavletich Ni'., Pabo Sd»ce 

„ 252:809-817(1991)).ReccnUy,tUsprotein«asa)soanalyzedbyDahiyat&I*.yo . 

Ll Solvent accessibUity of Zif26B 

Solvent accessibility (SA) was determined based on the coordinates of tb.e 
native protein and each residue along the chain was assigned a level of SA-buried, 
-exposed or -intermediate. Table 16 presents a comparison between the SA obtamed 
,5 according to one embodiment of the present the invention and ihose obtained by 
Dahiyat and Mayo^'^^ 

Tabic 1: Solvent Accessibility of 



Residue No 
1"* Struct'-'' 

SA'" 
D&M's SA*'' 



5 10 : 15 20 25 

E E E T T E E E H H H H H H H H H H H H 

K P F Q C R I CMRN F S R S DH L T T H I R T H T GE 

j i b e e e e e e i e e e e e i e e i i e e i e e e 

i e b e i e e e e i e e e e e i e e i i e e i e e e 



e 

e e 



TH^ndary structure containing Extended sheet (E), Turn (T) or Helix ^H); 
Zif268 wild-type sequence written in one letter code; 
20 ^^'sVlvent accessibility as determined by the invendon, categorized as buried (b), exposed (e) or 

intermediate (i): 

^''' Solvent accessibility as determined by Bahiyat and Mayo"". 
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T.w,. I there are only two differences in Zif268. solvent ; 
Ac mpy be seen from. Table 1, mere are way , , 

,at.e. empioying — a.go.W» [ConnoUy M. L. Scence » , 
.bse,ue„, manual chan^ ^ SA assig^en. of posWons U and ^ fton,^ 

, „ class .0 *e «pos^ Cass. PosWoa 4 ,3 classified as an i«™e^.- 

.He prcse. toven.o.s 3l,0.*n, aad as an closed r^i.^ D^s wo* a„d 
posKon 7 is Classfficd as an exposed residue by 4e pr^e^ mven^on's algon^^ 

as an intecmediate in D&Ks work. 

1.2 nemmoli/optmizati<,nproce^'>fK'^'rgyscorepr<,ftk 

A Mon« Carlo (MC) search was cond^d as described hereinbefore and fte 

, , fi,™..i„„ of MC trials is calculated. The results which 

profile of the score as a hancUou 01 mv- «■ , . , 

depended on the te,npera^ of <^ system T, are presented Fi^ 1 and 2. 
Figure , presents the ener^ profile at three constant te-nperamre parameters. lOo^ 
500K and lOOOK. Hgure 2 presents the energy profile u.ing an 
re^peran^e parameters. The «a.^n.al temperatures were also lOOK, 500K and OOK 
and dte periodicity was 500 Monte Carlo steps. Namely, during each cycle of .00 MC 
steps .*e tempe^mre par^eter is gradually reduced until it reach« «o, a, wtach 
point the temperan^ parameter is set again to its mitial value for a n^ 500 step 
annealing cycle .0 begin. The total s.ze of the scar^ space was 3.41x10 but m a« 
■ « cases within less than 2000 iterations *e algorithm reached dte range of stable 
sequence between -270 kcal/mol and -300 kcal/mol. according to the scoring ftncfon. 

I, can be seen ftom Figs. 1 and 2 that the opttetiotlon reaches lower scores 
when the periodic am.ealing temperamre profile is used. This profile enables the 
program ,0 escape local mimma by accepting high-energy sciences that would not be 

,5 accepted, but at the same ume. ,0 optimise locally when the temperature is reduce 
Among the three temperamres examined, the search reached lowest values for the 
scoring fimction initial temperamre pammeter was lOOOK and an annealing profile was 

used. 
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Figure 3. shows the energies of the 20 lowest sequences generated by the . 
algoTitlim witli different simulation lengths and different temperatures, using an 
annealing temperature profile with a periodicity of 500 Monte Carlo steps. The results 
show that at 500K the algorithm converges after 10* iterations, at lOOQK after 10^ and 
at 1.00K after 10* iterations and reached different energies each time. 

The iengtli of lo" iterations of the zinc finger protein simulation required one 
CPU hour on a single alpha processor workstation, and about 1.5 hours on Pentium 

rape. 

1.3 Results of motif design 

A total number of 50 simulations of the program were performed each one 
terminated after 10^ iterations under the same temperature conditions: an annealing 
temperature profile with an initial temperature of lOOOK. Several different lengths of 
Monte Carlo steps periodicity were tested and it was found that the best annealing 
periodicity for a simulation of 1 0*^ iteration was 1 0^ steps. 

Each simulation began with a different random seed but, with the same 3D 
backbone teinplate. The set of 50 simulations was repeated twice, each set with 
different solvent accessibility (SA) assignment for the protein residues. The first set 
used the present invention's automated solvent accessibility algorithm and the second 
set used Dahiyat and Mayb's (D&M) fitted assignments (see Table 1). 

Tables 2A and 2B present the lowest energy sequences obtained in the first and 
second sets (A and B respectively), aligned with the second zinc finger module of the 
DNA binding protein ^i/2<J5 and with D&M designed sequence, FSD-1. The 
coordinates used for the FSD-1 ppa motif score evaluations are the experimental NMR 
coordinates (FOB code IFSD), which were.found by D&M^"^ All the energy scores in 
Table 2B were calculated according the method of the present invention's reduced 
representation of amino acids and its scoring function. A and J5 scores were found to be 
lower than both Zift68 score (without considering the His^Cys^ Zn-binding 
interactions which are not included in the scoring function), and the FSD-1 score. The 
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energy score of the most stable sequence, A, is -351.8kcaI/moI. This score is lower than 
Zia68 score by 1 1 1 .3kcal/mol which is a significant difference (not tacldng into 
account the Zn interactions). The relative stability of both A and B sequences in 
comparison to the FSD-1 sequence, may be in part due to the fact the FSD-l sequence 
5 was designed with a different scoring fucntion. 

Tabic 2A - The most stable sequence obtained for the Zinc finger 



2'"'struc.f" 


EEE TT EEE 


HHHHHHHHHH H H 




i i b i 


i ) i i 


D&VTs SA® 


i b i i 


i i i i 


Position 


5 10 


15 20 25 


FSD-1'-" 


QQYTAKIKGRTF 


RNEKELRDF I EKFKGR 




K P F Q C R I CMRN F S R S DH L T T H I R T HT G E 




EHMFVHHHTTRF 


S SHTSFTSFLRSMQGR 




QHMTVHFHNTTF 


SHHSSLSTFLQSFQGR 



10 



Secondary structure containing Extended sneet [p), \ urn ncii^t vn;, 

Solvent Accessibility as determined by the invention, categorized as buried (b), exposed (e) or 
intermediate (i) (all other positions are exposed); 
The sesquence as designed by D&M^^''^ 
Wild-type Zif268 sequence written in one letter code; 

Sequence obtained by tlie present invention using the SA calculation described herein; 
Sequence obtained by the present invention using D&M SA fitted assignnients. 
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Table 2B - Energy scores of the sequences presented in Table 2A 



Sequence 


Energy (kcal/mol) 




-166.5 


Wild type Zif268 (no Zn*) 


-240.6 




-351.9 




-348.7 



10 



The sequence designed by D&M^"^; 
® The sequence obtained by the present invention using SA calculations as described. 
•"The sequence obtained by the present Invention using D&M SA fitted assignments. 



1.4 Analysis of the resulting sequence 

Statistical calculations over tbe 50 sequences obtained in 50 simulations (data 
not shown) provide the following observations :- 

1. Even though all of the hydrophobic amino acids were allowed at the 
intermediate positions, the algorithm selected only non-polar amino acids 
at all those locations. This agrees well with the finding that these form a 
Well-packed buried cluster [Dahiyat B. I. (1997), fiji/.]. 

2. For positions 5, 8, 21, 25 k the original Zn-binding amino acids (two 
,5 cysteines (C) and two histidines (H), the algorithm consistently selected 

residues of a well defined solvent accessibility character (even at 
"intermediate" positions). Hydrophobic amino acids (Val, Phe, Leu, He, 
and Met) were selected for positions 3, 21 and 25 which are classified as 
either "buried" or "intermediate". Ifi the single exposed position 
20 (position 8), a hydxophilic amino acid was selected (His). 

3. Positions 21 and 25 of the optimal sequences weie selected to be Phe or 
Met (position 21) arid Leu (position 25) side chains. In the original 
Zif268, these positions were occupies by the zinc binding His residue. 
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These positions are more than. 80 percent buried. Position 5, which is 100 
percent buried, was predominantly selected to be Val. The other 
boundary positions demonstrate the steric constrains on buried residues 
by packing snnilar side chains to tb.ose of the original Zif268 sequence. 

4. In the helix region (residues 15-26) the algorithm placed r-vo Lcti side 
chains and one Glnu which are good helix forming residues, in sequence 

B, and one Leu and one Gin in sequence A. ■ 

5. In both A and B sequences, position 5 on the exposed sheet surface was 
selected by the algorithm to be Val, which is a very good p-sheet forming 
residue, and positions 4 and 10 (and 11 only in sequence B) were 
selected to be Thr, which is also a good p-sheet forming residue. 

6. Alignment of the optimal stable sequence (B) and Zif268 indicates that 4 
out of 27 residues (not including residue 27 that remains Gly throt^hout 
the simulation) ate identical. (15%) and 11 are similar (including the 
identical 40.7%). D&M obtained sunilar values, with 5 identical residues 
(18.5%) and 12 sinular (44.4%). 

7. Alignment of the sequence B and FSD-1 indicates that 5 out of 27 
residues are identical between the sequences (18,5%) and 11 are similar 
(including identical 40.7%). 

1.5 Secondary structure prediction of the designed sequence 

Sequence A and B were further examined by secondary structure prediction by 
the SSPAL predictor at Sanger Centere [Salamov A. A. and Solovyev V. V. J. Mol. 
Biol. 247:11-15 (1995)], which enable to predict the secondary structure of a protein 
according to its primary structure (amino acid sequence). By these programs both A 
and B sequences were predicted to have the desired Zinc.finger motif. Table 3 presents 
the secondary structure of the native protein (Zif268) according to the Protein Data 
Bank (PD.8), and A. and B secondary structure prediction, according to SSPAL 
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algoritbm at Sanger Centre. A was predicted to have one a-helix (designated H) and 
two jS-straiids (designated E)(th.e ppa motif) v/hile the predicted secondary structure to 
5 contained only one cc-helix and one |3-strand. 

Table 3 - Secondary structure of predicted primary structures A and 



Position 
SS''^ PDB 
Z;G68 

SS^'^A 



B 



5 10 15 20 25 

EEE TT EEE H H H H. H H H H H H H H 
KPFQCRICMRNFSRSDHLTTHIRTHTGE 

EEEE ,EE HHHHHHHHHHH 

EHMFVHHHTTRFSSHTSFTSFLRSMQGR 

EEEEEE HHHHHHHH 
QHMTVHFHNTTFSHHSSLSTFLQS F QGR 



5 Secondary Structure 

1.6 Molecular dynamics of the re-designed sequences 

The reduced representation of the lowest energy designed sequences A and B, 
was expanded to an all-atom representation, using the molecular mechanics package 
CHARMM. The input for this e-;q)erinient was the backbone coordinates of the native 

10 protein, the new designed residues and the dihedral angles of each positioio. along the 
designed sequence derived from the rotamer with the lowest energy score. The number 
of atoms of A and B after expansion to all atoms, were 459 and 446, respectively. 
Energy minimization was performed for A and B's side chains as well as to Zif268 side 
chains using CHARJSiM forcefield, tlie SHAKE algorithm [Van Gunsteren W.F. & 

15 Bcrendsen H.J.C. Mol. Pliys. '34:1311 (1977)], a dielectric constant of e=1 and a i2A 
energy cutoff. Hie minimization included 200 steps of SD (Steepest Descent) and then 
additional 500 steps of ABNR (Adopted Basis Newton Raphson). After minimization, 
each of the three structures were embedded in an 18A water sphere which included 
~1 870 water molecules of type TIP3P [Jorgenes W.L. et al. J Chem. Phys. 79:926-935 

20 (1983)]. Each of the water-protein systems of A and B were simulated for 500ps at 



300K (With a ).6A energy cutoff), aad a saxnp!.e of 2000 conformations was. colJected 
from the resulting molecular dynamics trajectory. It was found that the secondary, 
strucmre of the protein s was maintained during the molecular dynan,ic simulations, 

The root-mean-square (rms) difference between the protein structure of tb^ 
designed sequences before the molecular dynamics simulation and their protein • 
structure after the molecular dynamics simulation was:- 

Scquence: Backbone: Side Chains: Total: 

4 1.84A 3.05 A 2.69 A 

B 1-84A 2.71A 2.43A 

These results clearly indicate that the overall fold of the designed proteins 
remained ppa with some relaxation of backbone and side chains. 

The molecular mechanics average energies of the two protein sequences during 

0 the simulations were:- 

^> ^37.4kcal/inol 
-196.4 kcal/mol 

The above energy results indicate that the method and system of the present 
invention is indeed useful for the prediction of primary sequences ihat stabilize the 
p pa motif even in the absence of the ion. 

The differences between the energy scores calculated for sequences A and 5 by 

,5 the scoring function of the present invention and by the CHAR.MM force-field (after 
dynamics) were only 20% and 44%, respectively, these results indicate that the scoring 
function of the present invention provides a satisfactory evaluation to the potential 
energy of the designed sequences. Furthermore, in both CHARMM force-field and the 
scoring fiinction of the present invention, sequence A yielded a significantly more 

20 stable structure that sequence B . 

For comparison of the zinc bound wild-type Zif268 was also simulated usitig a 
system of solvated Zif268 wherein only the water sphere was allowed to move, while 
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fixed The system «as simulated tor lODp. a. 300K, 
protein cocdinates-e Kept fixed.^ J ^ ^ 

^^cs ttaieeto^. The avet^CJU^^^ ^^^^^^^ 

(Zn interaction with its foui anchor res, ^ .^ore by the Motion 
..on.or.e^at«t.encewi.re.«t»^ 

„£*e present mvention compared to tts ^ ,„„Mbu«on of d>e, 
^iculation of the native protetn does 

His'Cys'Zn-bindins interactions. ■ 

. .,.,Dstmc.ur= of Z.?2« as compared to that of the 

4A-4C show the 3D stores ^^^^ ^^^ ^^ 

. designedpro.eins.and.aftermtmmt.a.to«^«> 

indndes hydrophobic Side charnsm .^ . ^^^^ ^ ^ 

- cysteh.es and two "J^*;!, ,4,.. by spheres si^ to the 

htdicate .he sood pac^. of the core .the 

d«jr Waals radii ot the atoms, 
13 designed sequences. 

E)cample2-Gplasatargctfoid 

. -.nrstr-ptocoecal protein G(GP/), a 56 residue protein. 

was examined. GPI JS , comprises six p-sttands 

„w*hi,hafr,nitybinain,.o.heJcre,onot^^ 

. „o „heUx An e.xtremely hyperthermophi,c vanant P 

2.1 Solvent accessibiUiy for domain 

, ,t.d as described in Example 1. A comparison of 
Soiventaccess-biii^wasevain^t^-^^^^^^^ 

results obtained by the method of the presen ^einbefore) is 

„ ,.,Mayo(M.^^■^^-."-*---*^^t2^*^^ 
..entedmn^e.^^^ese. 0^^^^^^^ 



methods. In 10 of tliese cases the discrepancy was between the subtle definition of a 
site as bemg "buried" or ^Intermediate". The solvent accessibility of the. eighth 
position (position 8) selected for optimization was identical in two classification 
schemes. 

2,2 Results ofG^l core design 

Tne energy score profile for G^l was obtained by the sw)q manner as described 
for Zif268. Farther, the energ}^ score profile obtained for G^l was similar to that 
obtained for Zif268 (Figs. I and 2\ using the same temperature conditions. However, 
since only 8 out of 56 residues (14.3%) were mutated, the initial energy was abready 
negative while the final energies obtained v/ete approximately -770 aad -790 Iccal/mol 
when an annealing temperature profile was used and the maximal value for the 
temperature parameter was lOOOK. 

Two sets of 50 simulations were conducted. In the first set of simulations, the 
non-mutated positions were kept in their native rotameric conformation while in die 
second set of simulations, the rotameric states of the side chains of the non-mutating 
residues were allowed to change, thus providing a larger number or possible mutations, 
and rotamer combinations. The total, size of the search space in the first set was 
1.06x10^^ and in the second set was 2.52xl0^\ Each simulation in the first set was 
terminated after lO'* iterations with a m.aximal temperature of lOOOK and an annealing 
periodicity of 100 MC steps. Each simulation in the second set was terminated after 
10^ iterations, with a maximal temperature of lOOOK and an aimealing periodicity of 
1000 MC steps. 

Table 4 presents the mutated residues in the lowest energy sequences among the 
50 simulations conducted for each set (C and D respectively). 
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Table 4 -Mutated residues 



E-p-slieet, H-helix; 
t^' Energy score by the scoring Action of the present invention 
1^'' Sequence designed by M&M 





^ ' ~" 


2"'' Structure"^ 


E 


E 




E 


H 


H 




E 




Solvent accessibility 


b 


b 


i 


i 


i 


i. 


b 


i 


Energy score^^^ 
kcal/mol 


Position 


3 


7 


16 


18 


25 


29 


39 


A 1 




Y 


L 


T 


T 


T 


V 


V 


W 


-676.4 


: c 


L 


L 


L 


F 


L 


L 


V 


M 


-778.6 . 


D 


L 


L 


F 


L 


L 


S 


V 


F 


-788.8 




F 


I 


I. 


I 


Q 


I 


I 


I 


.1 



10 



15 



2 J Analysis oftheresuUing sequences 

Statistical calculations over the 50 sequences obtained in 50 simulat^^^^^ 

fixed and non-fixed conformations, the latter having rotameric freedom (sets C and A 
data not shown) provided the following observations:- 

In 72%-76% of the sequences buried positions 7 and 39 were found to be 
Leu and Val, respectively, as in the native protein. 

In most sequences Tht25 was mutated to Leu, which has a better helix 
propensity. Examining of sequences C and D by SSPAL predictor at 
Sanger Centre (as described hereinbefore) and by the PHD predictor at 
EMBL [Rost B and Sander C. J. Mol. Biol. 232:584-599 (1993)] showed 

that mutations T25L and V29L maintained the secondary structure of the 
a-helix. 

3. More than 80% of tlie sequences consisted of the mutations T16L, T18L 
and Tl 8F, which are located in a (J-strand. The Leu and Phe apparently 



I. 



2. 



propensi^-. 

/ The buried pcsiUo. 3 ch^^ged ftc„ Tyr n^ly to Le„, which is mo« 
■ Mrophobic and is predicted to taptove side chai. packmg « «.e 
in.terior of the protein.. 

5 Vo.r p.rce« of .he 50 sequences in *e fes. simulation set. »here the 
coordinates of the non-mutated residues were kept fixed, had T:p m 
position 43, the same as in the native protein. In the second simulation 
set. this position was predominantly assigned with Phe residue. 

^f ihf- desiened sequence C, which was the sequence 
The reduced representation of the <ies) gQea5C4uc 

^tt, *e lowe. ener^ score in the first set of simdations. where the conformati^of 
n non-»uta,cd residues was Kept fixed, was expanded ,0 its alUtom rep«s— 
using CHAI««. [BrooHs « .i. (.9S3) «1, « the same n^er as des^^^^ 

hJnbefore. in .neral, the inf<™ation provided as input - . 
....inates. of the native p^eh., the new residues and the -"-" -^^ 
each position alon, *e chain. The number of atoms in se^ce C wa^^ S61 En«^ 
mi^imizaiiorv was performed for the side ch.ns of this science as we . f G 1 
chains us^ .CHARMM f^e-field [Mackerel .r 
„ dielectric c«.s«nt e=l and a 16A energy cutoff. The minimi^on mcluded $00 steps 
of SD and ftenadditional 1100 steps of ABNR. The average energies ot te se<pences 
after minimization were:- 

Q^j, -804.2 kca)ymol 

Q -822.7 kcal/mol 

The above results suggest that ti« mutations obtained by *e methodology 
disclosedher^aretolerablewhichmayleadtothedesignedofamorestableprotem. 
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The difference between the energies of tlie native sequence G^l and sequence 
C, based on the present invention's scoring function and on CHARMM-s force-field 
(after dynamics) were 7% and 16% respectively, which strengthens the conclusion that 
the method and system of the present invention provide a reUabie tool for designing de 
novo proteins. 

■ The above statement is further strengthened m light of Figure 7, which show.a 
comparison of the 3D structure of G3 I and sequence C. 

These results show that re-designing five boundary residues and three buried 
positions in the core of pi domain of Streptococcal protein G was tolerable and that a 
stable, fiilly folded de novo protem may be obtained. 

While the foregoing description disclosed in detail only a few specific 
embodiments of the invention, it will be understood by those sldlled in the art that the 
method and system of the invention is not limited for the design of these proteins. 
Further, it should be understood that other variations of the method and system of the 
invention may be possible without departing from the scope and spirit of the invention 
as herein disclosed. 



